[SPARK-17556][SQL] Executor side broadcast for broadcast joins#15178
Closed
viirya wants to merge 12 commits into
Closed
[SPARK-17556][SQL] Executor side broadcast for broadcast joins#15178viirya wants to merge 12 commits into
viirya wants to merge 12 commits into
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
The mechanism of broadcast in Spark is to collect the result of an RDD and then broadcast it. This introduces some extra latency. We can broadcast the RDD directly from executors. This patch implements broadcast from executors, and applies it on broadcast join of Spark SQL.
The advantages of executor-size broadcast:
Design document: https://issues.apache.org/jira/secure/attachment/12831201/executor-side-broadcast.pdf
Major API changes
New API
broadcastRDDOnExecutorinSparkContextIt takes two parameters
rdd: RDD[T]andmode: BroadcastMode[T]. It will broadcast the content of the rdd between executors without collecting it back to the driver.modeis used to convert the content of the rdd to the broadcasted object.Besides
T, this API has another type parameterU, which is the type of the converted object.New
BroadcastimplementationTorrentExecutorBroadcastDifferent to
TorrentBroadcast, this implementation doesn't divide and store object data waiting to broadcast in the driver. The executors use local and remote fetches to fetch the blocks of the RDD and convert the rdd content to broadcasted object.BroadcastModeis moved fromorg.apache.spark.sql.catalyst.plans.physicaltoorg.apache.spark.broadcastIt is added a type parameter
Tnow which is the converted type of the broadcasted object on executors.Usage: How to use executor side broadcast
To broadcast the result of a RDD, instead of collecting the result back to the driver and broadcasting it, we can use executor side broadcast feature proposed in this proposal.
Prepare the RDD to be broadcast
Define how to transform the result of the RDD with
BroadcastModeBroadcast the RDD and use broadcasted variable
How was this patch tested?
Jenkins tests.